README.TXT
Matt Turner, May 8, 2020
********************************

This code generates the data and results for Henderson and Turner (2020).  Unzip the repository. The resulting code has two main components. The first is in the `dofiles' directory. This code does the heavy lifting of organizing GIS data and matching it to survey data. It results in the stata files that form the basis of our analysis.  The second component is in the directory called `analysis'. It operates on data created in the first stage to generate results used in the paper. Most of the data, both source and output, is in the directory `data'.

Software and Hardware
********************************
This code was last run on a server running Windows Server 2012. It has 96GB of RAM, but I don't think you need anywhere near that much. It takes about 8 hours to run all the way through on this server.

You will need the following software:
-ARCGIS10.6. You will need the `spatial analyst' extension.
-Stata 15
-StatTransfer 15 with the command line extension 
You will need to make sure that StatTransfer runs from the command line. Type in `st' from a shell prompt and you should get queries from stattransfer.  This usually happens automatically on set-up, but you may need to adjust the environment variables. The process for doing this is OS specific.  

You will also want an IDE for Python. I use WingIDE5.1, but I think any IDE will probably do. 


SOURCE DATA AND AVAILABILTY
********************************
To run these programs you will need four data sets described below. In order for us to share these data sets with you we will require:

(1) written authorization from the DHS administrator 
(2) evidence that you have registered with the LSMS, and 
(3) a copy of you authorization to use the confidential Afrobarometer data.  

The data sets and application proccess is described below.

1). Global Human Settlements (GHS).
------------------------------------------
We use "Community pre-Release of GHS Data Package (GHS CR2018) in support to the GEO Human Planet Initiative".  These data are publicly available. See documentation in data_processing\data\GHS\source.  

These data are not confidential and the source data are included with the in the directory data_processing\data\GHS\source\. 

2). Demographic and Health Survey (DHS):
------------------------------------------

We rely on the highly processed DHS data that was the basis of the analysis in Henderson et al. (2020). DHS terms of use ( https://dhsprogram.com/data/Terms-of-Use.cfm ) prohibit us from sharing data with you unless you obtain the written consent of the DHS proogram.

To obtain such consent, you will need to first obtain permission to access the confidential geocoded data and then ask them for a letter authorizing us to share our extract from their data.

To obtain access to the confidential DHS data, 

1) create an account at https://dhsprogram.com/ via a link to login

2) create a project within your account. 

3) Check off all the countries for which you want data. Make sure you
check "Show GPS data" and check off the GPS data for each country you
want, in addition to the regular data. This is our country list:

Eastern Africa: Burundi, Comoros, Ethiopia, Kenya,
Malawi, Mozambique, Rwanda, Tanzania, Uganda, Zambia, Zimbabwe
Western Africa: Benin, Burkina Faso, Cote d'Ivoire,
Ghana, Guinea, Liberia, Mali, Nigeria, Senegal, Sierra Leone, Togo
Middle Africa: Angola, Cameroon, Chad, DR Congo, Gabon
Southern Africa: Lesotho, Namibia
South Asia: Bangladesh, India, Nepal
South-east Asia: Cambodia, Myanmar, Philippines, Timor-Leste
Latin America and the Caribbean: Colombia, Dominican
Republic, Guatemala, Haiti, Honduras

You should get an email within a day or so saying you are approved. At
that point please send the DHS administrator the following letter:

****
Dear DHS Archive,

I am writing to request your permission for Matthew Turner to send to me the replication data files for the following paper:

Henderson, J. Vernon and Matthew A. Turner. 2020. Urbanization in the developing world, to early or to slow? Journal of Economic Perspectives.

I have received permission to download the underlying DHS data (see attached).
***
We need an electronic copy of your letter and their affirmative response.

We will send you a zip file containing a subset of the /data directory. Unzip, and copy source files into the corresponding folders.   This directory contains the following DHS files: 

	Directory of S:\JEP_with_JVH\data_processing\data\DHS\source\

	deliverables/birth.dta
	deliverables/children.dta
	deliverables/definitions.xlsx
 	deliverables/female.dta
	deliverables/hh.dta
	deliverables/hhmember_lifestyle.dta
	deliverables/hhmember_school.dta

3). Living Standards Measurement Survey (LSMS):
------------------------------------------

This is survey data generated by a World Bank program of the same name, http://surveys.worldbank.org/lsms/about-lsms

The Living Standards Measurment Surveys (LSMS) are publicly available on the World Bank Microdata Library. One must register to download and use the data at https://microdata.worldbank.org/. 

We use a highly processed extract of these data that we was the basis of Henderson et al. (2020). The particular LSMS surveys on which our extract is based are: Tanzania Panel Household Survey (2008 and 2010), Nigeria National Household Survey (2010 and 2012),  Uganda National Panel Survey (2009, 2010, 2011, and 2012),  Ethiopia Socioeconomic Survey (2011, 2013, and 2015), Malawi Integrated Household Survey (2010 and 2013), and Ghana Socioeconomic Panel Survey (2010 and 2013).

Once you register with LSMS we will share two data files:

LSMS_IND.dta
LSMS_HH.dta

Copy them to:

S:\JEP_with_JVH\data_processing\data\LSMS\source

4). Afrobarometer: 
------------------------------------------

Afrobarometer is confidential survey data. To apply for access to the afrobarometer data, go to this link:
http://afrobarometer.org/data/geocoded-data

We used version 6 of the geocoded Afrobarometer data. Our data file is called afb_full_r6.xlsx and it should go in \data_processing\data\afrobarometer\source. 

We downloaded it from the Afrobarometer website in the fall of 2019. Conditional on an approved request for geocoded data from Afrobarometer, we will send you our source data:

For reference, here is a summary of the first few variables in our raw data:

	. sum;
	    Variable |        Obs        Mean    Std. Dev.       Min        Max
	-------------+---------------------------------------------------------
	      respno |          0
	     country |     35,804    25.57644    6.646718          1         36
	country_r5~t |     35,804    25.47551    6.575479          1         37
	countrybyr~n |     35,804    2.498324    1.089915          1          5
	      urbrur |     35,804    2.738493    22.70285          1        460

4. World Development Index (WDI):
------------------------------------------

The paper also reports on world bank and UN data in tables 1 and 2.  We didn't do much to these data apart from gather them up in a spreadsheet.  This spreadsheet is in data/WDI.  This directory is not part of data processing exercise described here. Note there is no code associted with these data, this is just a spreadsheet.  It is the basis for table 1.


INSTRUCTIONS:
********************************

Processing is in two main blocks. One is organized by dofiles/readme_run.do, and one is organized by analysis/readme_run_analysis.do.  Following is a desription of code in /dofiles and /analysis.

/DOFILES:
------------------------------------------

The data for this paper is organized around the GHS data. We first unproject these data. This gives a grid of population data IN POLAR COORDINATES, where all cells are the same size measured in degrees.  We then, very carefully, convert this grid into a list, where we make sure that we can go back and forth between list order and grid coordinates.  We repeat this operation for a shape file of countries. This lets us add a country field to our big list of GHS data.  This done, given survey data with latitude and longitude (i.e., polar coordinates) we can easily calculate the grid cell that should contain them, and hence the list item we should match them too.  Finally, we calculate mean population density in a neighborhood of each survey respondent.

All code is run from dofiles/readme_run.do.  This stata dofile lists each of the subroutines involved in processing the data and will execute them as well. Steps 1-7 will run w/o confidential data. Almost every stata subroutine keeps a log file with the same name. The date stamp on these files tells you when each program was last run. Many of the python programs keep logs also, but this is less systematic and the logs are less informative. 

There are very few absolute paths in the programs, but most of the stata programs contain a pointer to the ARCGIS python implementation. There is an example at the top of readme_run.do.  You will need to change these to point to the corresponding Python executable for your system.  Don't bother trying to use a non-ARCGIS python install, it won't work.

readme_run.do will run everything. For debugging I recommend that you treat it as a table of contents and run the programs manually, in the order they are listed. This will be particularly helpful for the Python programs becasue this code is not good at managing errors in the python routines. To do this, I often run the Python programs from an IDE.  

I run all Stata programs by right clicking in an explorer window and then using the `execute' command in the context specific pulldown menu.  The directory that these programs start in matters because all file calls rely on relative paths. If you execute the stata programs in this way, you should get the right starting point. Otherwise, you will probably need to set the correct starting directory at the top of each dofile. 

Most source and major output files are in the `data' directory. Several of the data subdirectories will have `source' and `generated' subdirectories.  The programs NEVER write to the source directories. Any file in the generated directories is generated by the programs.

/ANALYSIS:
------------------------------------------

The programs in this directory use the output from readme_run.do to generate the figures and tables in the paper.

You can run all of the programs by executing analysis/readme_run_analysis.do.  This program calls a series of stata programs. Only the first step, analysis of GHS data, will run without confidential data. Each stata subroutine keeps a log file. This log will report supplementary output and will indicate errors.  The data stamp tells when each program was last run.

readme_run_analysis.do is basically a table of contents. You can run these programs in any order. There is a switch at the top that will turn off programs that rely on confidential data.

Some of the output from these programs takes the form of snippets of latex tables. You will need to copy these into a working latex document by hand to get them to compile.

Here is a list of all output figures and table fragments. The program that generates each is listed above each set of output files:

Program:data_processing/analysis/surveys_ab/survey_ab_v6.do
-----------------------------------------
analysis/surveys_ab/graphs/Afrobarometer_l_ab_fear_walk_binsreg.pdf
analysis/surveys_ab/graphs/Afrobarometer_l_ab_fear_walk_binsreg_c.pdf
analysis/surveys_ab/graphs/ab_regs.tex

Program:data_processing/analysis/surveys_dhs/survey_dhs_v7.do
-----------------------------------------
analysis/surveys_dhs/graphs/DHS_l_diarrhea_unwt_binsreg.pdf
analysis/surveys_dhs/graphs/DHS_l_diarrhea_unwt_binsreg_c.pdf
analysis/surveys_dhs/graphs/DHS_l_improved_sanit_unwt_binsreg.pdf
analysis/surveys_dhs/graphs/DHS_l_improved_sanit_unwt_binsreg_c.pdf
analysis/surveys_dhs/graphs/DHS_l_obese_unwt_binsreg.pdf
analysis/surveys_dhs/graphs/DHS_l_obese_unwt_binsreg_c.pdf
analysis/surveys_dhs/graphs/DHS_l_school_8yr_unwt_binsreg.pdf
analysis/surveys_dhs/graphs/DHS_l_school_8yr_unwt_binsreg_c.pdf
analysis/surveys_dhs/graphs/dhs_regs.tex

Program:data_processing/analysis/surveys_lsms/survey_v6IND.do
-----------------------------------------
analysis/surveys_lsms/graphs/lsms_l_hrly_wage_binsreg.pdf
analysis/surveys_lsms/graphs/lsms_l_hrly_wage_binsreg_c.pdf
analysis/surveys_lsms/graphs/lsms_IND_regs.tex

Program:data_processing/analysis/surveys_lsms/survey_v6HH
-----------------------------------------
analysis/surveys_lsms/graphs/lsms_l_net_income_binsreg.pdf
analysis/surveys_lsms/graphs/lsms_l_net_income_binsreg_c.pdf
analysis/surveys_lsms/graphs/lsms_HH_regs.tex

Program: analysis/GHS/GHS_hist1_v4.do (figures 1AB,2)
-------------------------------------
analysis/GHS/graphs/allmono_cdf_v4.pdf
analysis/GHS/graphs/mono_cdf_area_v4.pdf
analysis/GHS/graphs/mono_ldensity_v4.pdf



REFERENCES:
********************************

Henderson, J. Vernon, Vivian Liu, Cong Peng and Adam Storeygard. 2020. Demographic and health outcomes by Degree of Urbanisation: Perspectives from a new classification of urban areas. European Commission.  

Henderson, J. Vernon and Matthew A. Turner. 2020. Urbanization in the developing world, to early or to slow? Journal of Economic Perspectives.


USAID (2020) Demographic and Health Surveys. https://dhsprogram.com/data/

World Bank (2020) Living Standards Measurement Study https://microdata.worldbank.org/index.php/catalog/lsms

Afrobarometer (2020) https://www.afrobarometer.org/online-data-analysis